Loading data files from Amazon S3

You can load data files from Amazon S3 storage.

Prerequisites

Collect your Amazon S3 connection information:
- access key
- secret key
- bucket name
- file name
Import PixieDust and enable the Spark Job monitor



In [ ]:

    
import pixiedust
pixiedust.enableJobMonitor()

Configure Amazon S3 connectivity

Customize this cell with your S3 connection information



In [ ]:

    
# @hidden_cell
# Enter your S3 access key (e.g. 'A....K')
s3_access_key = '...'
# Enter your S3 secret key (e.g. 'S....K')
s3_secret_key = '...'
# Enter your S3 bucket name (e.g. 'my-source-bucket')
s3_bucket = '...'
# Enter your csv file name (e.g. 'my-data/my-file.csv' if _my-file_ is located in folder _my-data_)
s3_file_name = '....csv'

Load CSV data

Load csv file from Amazon S3 into a Spark DataFrame.



In [4]:

    
# no changes are required to this cell
from ingest import Connectors
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

S3loadoptions = { 
                  Connectors.AmazonS3.ACCESS_KEY          : s3_access_key,
                  Connectors.AmazonS3.SECRET_KEY          : s3_secret_key,
                  Connectors.AmazonS3.SOURCE_BUCKET       : s3_bucket,
                  Connectors.AmazonS3.SOURCE_FILE_NAME    : s3_file_name,
                  Connectors.AmazonS3.SOURCE_INFER_SCHEMA : '1',
                  Connectors.AmazonS3.SOURCE_FILE_FORMAT  : 'csv'}


S3_data = sqlContext.read.format('com.ibm.spark.discover').options(**S3loadoptions).load()

Explore the loaded data using PixieDust



In [5]:

    
display(S3_data)









    




Hey, there's something awesome here! To see it, open this notebook outside GitHub, in a viewer like Jupyter

For information on how to load data from other sources refer to [these code snippets](https://apsportal.ibm.com/docs/content/analyze-data/python_load.html).



In [ ]: